-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory usage in nested JSON parser - tree generation #11864
Reduce memory usage in nested JSON parser - tree generation #11864
Conversation
reduces memory usage by 35% (1GB json takes 10.951GB instead of 16.957GB)
reduce peak memory usage (not total memory used) reorder node_range, node_cat, scope limit token_levels 10.957 GiB -> 9.91 GiB -> 9.774 GiB -> 9.403 GiB
9.403 GiB to 8.487 GiB (for 1GB json input)
These NVTX RANGES macros might be useful. Commenting here for wider audience & as searchable reference. This These variables are not static because it can be used inside loop with loop index as label too. #define _CONCAT_(x, y) x##y
#define CONCAT(x, y) _CONCAT_(x, y)
#define NVTX3_PUSH_RANGE_IN(D, tag) \
::nvtx3::registered_message<D> const CONCAT(nvtx3_range_name__, __LINE__){std::string(tag)}; \
::nvtx3::event_attributes const CONCAT(nvtx3_range_attr__, \
__LINE__){CONCAT(nvtx3_range_name__, __LINE__)}; \
nvtxDomainRangePushEx(::nvtx3::domain::get<D>(), CONCAT(nvtx3_range_attr__, __LINE__).get());
#define NVTX3_POP_RANGE(D) nvtxDomainRangePop(::nvtx3::domain::get<D>());
#define CUDF_PUSH_RANGE(tag) NVTX3_PUSH_RANGE_IN(cudf::libcudf_domain, tag)
#define CUDF_POP_RANGE() NVTX3_POP_RANGE(cudf::libcudf_domain)
#define NVTX3_SCOPED_RANGE_IN(D, tag) \
::nvtx3::registered_message<D> const CONCAT(nvtx3_scope_name__, \
__LINE__){std::string(__func__) + "::" + tag}; \
::nvtx3::event_attributes const CONCAT(nvtx3_scope_attr__, \
__LINE__){CONCAT(nvtx3_scope_name__, __LINE__)}; \
::nvtx3::domain_thread_range<D> const CONCAT(nvtx3_range__, \
__LINE__){CONCAT(nvtx3_scope_attr__, __LINE__)};
#define CUDF_SCOPED_RANGE(tag) NVTX3_SCOPED_RANGE_IN(cudf::libcudf_domain, tag) Update:
|
This reverts commit 5eefd64.
Codecov ReportBase: 87.40% // Head: 88.11% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-22.12 #11864 +/- ##
================================================
+ Coverage 87.40% 88.11% +0.70%
================================================
Files 133 133
Lines 21833 21881 +48
================================================
+ Hits 19084 19281 +197
+ Misses 2749 2600 -149
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass
auto pid = first_childs_parent_token_id(tid); | ||
return pid < 0 | ||
? parent_node_sentinel | ||
: thrust::lower_bound(thrust::seq, node_ids_gpu, node_ids_gpu + num_nodes, pid) - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will probably be a good point to do some precomputations (hashmap or sparse bitmap) if this kernel becomes a performance issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one takes ~20% of the parent_node_ids computation time. it take significant time, but not the critical one.
hash_map takes extra memory. Tried it now. memory increases from 7.97 GiB to 9.271 GiB. It's much slower than lower_bound; 12 ms lower_bound, vs 133 ms hash_map.
I am interested to learn about the "sparse bitmap" approach .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an approach I often use to speed up binary search-like lookups where the indices are unique. The fundamental idea is a bitvector with rank support:
You store a bitvector containing a 1 if the corresponding token is a node and 0 in groups of 32 bit words and compute an exclusive prefix sum over the popcount of the words. Computing the lower bound is then prefix_sum[i / 32] + popcnt(prefix_mask(i % 32) & bitvector[i / 32])
where prefix_mask(i) = (1u<< i) - 1
has all bits smaller than its parameter set. Overall, this uses 2 bits of storage per token. If the number of tokens is much larger than the number of nodes, you can make the data structure even sparser if you only store 32 bit words that are not all 0 (basically a reduce_by_key over the tokens) and use a normal bitvector to store a bit for each 32 bit word denoting whether it was non-zero. Then you have a two-level lookup (though the second level could also be another lookup data structure like a hashmap). The data structure has pretty good caching properties, since locality in indices translates to locality in memory, which hashmaps purposefully don't have.
But with the number of nodes being not smaller than the number of tokens by a huge factor, I don't think this would be worth the effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review -- initial comments. I'll submit additional comments shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finished my first pass of review (this is the second half). Comments attached.
Co-authored-by: Bradley Dice <[email protected]>
@gpucibot merge |
Description
Reduces Memory usage by 53% in nested JSON parser tree generation algorithm.
1GB JSON takes 8.469 GiB instead of 16.957 GiB. All values below are for 1 GB JSON text input.
This PR employs following optimisations to reduce memory usage
thrust::stable_sort_by_key
(9.403 GiB -> 8.487 GiB)cub::DoubleBuffer
which eliminates copy of order. (8.487 GiB -> 7.97 GiB)The peak memory is reduced by 53%, parsing bandwidth still remains same. (1.6 GB/s in GV100 for 1GB JSON).
Since
get_stack_context
in JSON parser takes highest memory usage (8.469 GB), peak memory is not influenced by JSON tree generation step anymore. Peak memory is now 50% of that of earlier code.Checklist